Flexible Bayesian semiparametric mixed-effects model for skewed longitudinal data

Background In clinical trials and epidemiological research, mixed-effects models are commonly used to examine population-level and subject-specific trajectories of biomarkers over time. Despite their increasing popularity and application, the specification of these models necessitates a great deal of care when analysing longitudinal data with non-linear patterns and asymmetry. Parametric (linear) mixed-effect models may not capture these complexities flexibly and adequately. Additionally, assuming a Gaussian distribution for random effects and/or model errors may be overly restrictive, as it lacks robustness against deviations from symmetry. Methods This paper presents a semiparametric mixed-effects model with flexible distributions for complex longitudinal data in the Bayesian paradigm. The non-linear time effect on the longitudinal response was modelled using a spline approach. The multivariate skew-t distribution, which is a more flexible distribution, is utilized to relax the normality assumptions associated with both random-effects and model errors. Results To assess the effectiveness of the proposed methods in various model settings, simulation studies were conducted. We then applied these models on chronic kidney disease (CKD) data and assessed the relationship between covariates and estimated glomerular filtration rate (eGFR). First, we compared the proposed semiparametric partially linear mixed-effect (SPPLM) model with the fully parametric one (FPLM), and the results indicated that the SPPLM model outperformed the FPLM model. We then further compared four different SPPLM models, each assuming different distributions for the random effects and model errors. The model with a skew-t distribution exhibited a superior fit to the CKD data compared to the Gaussian model. The findings from the application revealed that hypertension, diabetes, and follow-up time had a substantial association with kidney function, specifically leading to a decrease in GFR estimates. Conclusions The application and simulation studies have demonstrated that our work has made a significant contribution towards a more robust and adaptable methodology for modeling intricate longitudinal data. We achieved this by proposing a semiparametric Bayesian modeling approach with a spline smoothing function and a skew-t distribution. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02164-y.


Introduction
Longitudinal data are present in numerous clinical and other follow-up studies that involve monitoring subjects over time to understand the impact of exposures, processes, or characteristics on outcomes.These studies involve tracking a group of subjects and recording data at different time points throughout the study duration.For example, one or more renal functional progress biomarkers (e.g., serum creatinine, albuminuria, glomerular filtration rate, and other biomarkers) of a chronic kidney disease (CKD) patient can be measured repeatedly until end-stage renal disease and/or other events of interest occur.
This research was driven by longitudinal data on CKD, a significant global health issue that affects approximately 500 million individuals worldwide [1].Around 80 percent of these CKD cases are found in low-and middle-income countries.A prevalence of approximately 35.52 percent of CKD was observed among people with diabetes in Ethiopia [2].To comprehend how CKD progresses within individuals and across populations, as well as to assess the impact of treatments over time, conducting longitudinal data analysis is necessary.
Longitudinal data can show a variety of features over time and across subjects in many real-world situations during follow-up studies.A suitable choice of methods for analysing such complex longitudinal data is therefore sought.The most popular method proposed is the linear mixed-effects (LME) model [3][4][5][6] with a Gaussian response.The generalized linear mixed-effects models [7][8][9] and non-linear mixed-effects models [10] have been also used as an extension of LME model.
Despite the increasing popularity of LME models in applications, the specification and statistical inference of these models may necessitate much attention when treating and analysing longitudinal data with many features.One of these features is that longitudinal data can exhibit nonlinear, irregular patterns over time, along with asymmetry.Thus, to model and analyse longitudinal data with this feature, LME (fully parametric) models may not be flexible enough.
Another feature is that, unlike linear models, mixed models make assumptions regarding the distribution of model errors as well as random-effects.In the literature, it is usually assumed that the model errors and/or random-effects follow a multivariate normal distribution.In practice, longitudinal data might exhibit asymmetric distributions, leading to biased statistical results [11,12].Because of this, employing a normal distribution for model errors may lack robustness against deviations from normality and may be too limited to accurately describe the among-and within-subject variability of longitudinal outcomes [13].Many previous studies suggest considering a more flexible distribution for model errors to make a valid statistical inference [13,14].There are different suggestions in the literature concerning the impact of misspecification of a random-effect distribution on parameter estimation and inference.For instance, Molenberghs and Verbeke [15] suggest that misspecification of the random-effects distribution can lead to biased parameter estimates in nonlinear and generalised linear mixed models; in linear mixed models, however, deviations from the normality assumption may have very little impact on parameter estimation.McCulloch and Neuhaus [16] considered a generalised linear mixed model using a maximum likelihood estimation technique to evaluate the misspecification of the distribution of a random effect.Their findings demonstrated the robustness of most aspects of statistical inferences to the normality of random effects.Other authors in the recent literature, however, suggest that future research should accept more flexible distributional assumption for random-effects in addition to model errors [17,18].As a result, skew distributions have recently been used in the literature to handle asymmetry and model longitudinal data more flexibly [18][19][20].
Thus, in this study, we propose a flexible Bayesian mixed-effects model in a semiparametric setting with a smoothing spline specification and skew distributions for longitudinal data with the aforementioned features.To assess the effectiveness of the proposed methods in various model specifications, simulation studies were conducted.Finally, the proposed model was applied to data on CKD.

Motivating CKD data and longitudinal outcome trajectories
This paper utilizes a dataset spanning eight years, from June 2014 to June 2022, in the context of chronic kidney disease (CKD).The CKD data was gathered from the University of Gondar Comprehensive Specialized Hospital in Ethiopia, primarily extracted from patients' profiles (or charts) and medical records.Only patients with three or more follow-ups are included in the analysis.The dataset encompasses repeatedly recorded renal function biomarkers, comorbidities, and baseline characteristics of 198 CKD patients.On average, the patients were approximately 55 years old, with 56.6% being male.Around one-third (34.4%) of the CKD patients in the study population had baseline hypertension.Furthermore, the baseline prevalence of diabetes among the CKD patients was determined to be 23.81%.
The estimated glomerular filtration rate (eGFR), which estimates the rate at which the kidneys filter blood, is utilized as a longitudinal response variable.Thus, the analysis of this study considered 1,425 eGFR measurements from 189 patients.The minimum, maximum and average number of measurements per patient were 3, 18 and 8, respectively.63.5% (120) patients had six and above number of measurements, and out of them 43% patients had ten and above measurements.Of the total 1,425 measurements, based on the National Kidney Foundation guidelines [21], 39.7% indicated CKD Stage 3 (moderate kidney disease), 32.9% indicated Stage 4 (severe kidney disease), and 14.6% indicated Stage 5 (end-stage renal disease).To accurately represent the diverse patterns of renal function progression and create an appropriate model, the analysis includes patients with an eGFR value below ninety.Figure 1 displays the eGFR profiles of patients with CKD.The figure depicts the presence of non-linear trajectories and a positively skewed distribution of eGFR over time.

Bayesian modelling The semiparametric mixed-effects longitudinal outcome model
In this paper, the longitudinal variable is denoted as y ij , which represents the value of the response eGFR for subject i at the j th time point t ij .The indices i and j range from 1 to m and 1 to m i respectively, indicating the total number of subjects and the number of measurements for each subject.Let x ij = (x 1ij , . . ., x pij ) T denotes a 1 × p vector of associated p covariates.Most previous studies on chronic kidney disease have taken a parametric approach, like utilizing linear mixedeffects models, to model the longitudinal response variable y ij and the associated covariates x ij .However, as demonstrated in the presentation of the motivating CKD data above, the outcome eGFR exhibits irregular (non-linear) trajectories over time.Therefore, this paper introduces a semiparametric mixed-effects model that considers the non-linear trajectories of y i by employing a spline approach.
where y i = (y i1 , . . ., y imi ) T represent the vector of longi- tudinal response variable, X i = (x 1i , . . ., x pi ) T denote the design matrix of fixed-effects, and H i = (h 1i , . . ., h qi ) T represent the design matrix of random-effects.β and ξ i represent parameter vectors that are associated with the covariates of fixed and random effects.In Eq. ( 1), the effect of measurement time t i = (t i1 , . . ., t imi ) T on the response y i is modelled using a non-parametric approach.This is achieved by employing a smoothing function N i (t i ) , which can be defined as follows: where U(t i ) and V i (t i ) represent unknown smoothing functions for the population and subject-specific variations of the longitudinal response y i due to time effects t i , respectively.The random vectors ξ i , V i (t i ), and ε i are assumed to be independent one another.
A regression spline method is utilized to specify the unknown functions U(t i ) and V i (t i ) in Eq. ( 2), and can be defined as a linear combination of spline basis functions, ki (t i ) = ( k (t ij ), . . ., k (t imi )) T and li (t i ) = ( l (t ij ), . . ., l (t imi )) T .Where. (1) Mathematically, the specification can be given by where η k = (η 0 , η 1 , . . ., η k−1 ) T and ϑ il = (ϑ i0 , ϑ i1 , . . . ,ϑ i(l−1) ) T are k × 1 and l × 1 parameter vectors of the fixed-effect spline basis k (t i ) and random spline basis effects l (t i ) , respectively.The B-spline, truncated power or natural cubic spline basis can be used to construct the bases ( k (t i ) and l (t i ) ) in (3).In this study, natural cubic spline with percentile-based knots is considered to approximate the bases.By using Eq. ( 3), model ( 1) can be rewritten as: be the fixed-effect (population) and random effects design matrices, respectively.Furthermore, let α T be the associated parameter vectors.
Then, model ( 1) can be reformulated as Most previous studies assumed a Gaussian distribution for the random-effects ϕ i (representing inter- subject variation of y i ) as well as for the model errors ǫ i (representing within-subject variation) due to its computational convenience.However, in this study, we considered multivariate skew-t distributions [22] for both ϕ i and ε i .That is, Where ST (.) is a skew- t distribution; ρ ϕ and ρ ε denote degrees of freedom; ϕ and ε = σ 2 ε I m i denote covariance matrices; and δ ϕ and δ ε = δ ε I m i are skewness vectors of the random-effects ϕ i and model errors ε i , respectively. (3)

Hierarchical reformulation of the model
The statistical inference from a semiparametric mixedeffects model with multivariate skew-t distributions using the likelihood approach can be computationally demanding.Hence, to overcome this challenge, we adopted the Bayesian approach, which offers computational efficiency.This approach not only reduces the computational load but also allows for more accurate parameter estimation by leveraging existing information (prior knowledge) for parameter estimation.By employing Markov Chain Monte Carlo (MCMC) algorithms, the Bayesian approach enables us to estimate the parameters more efficiently while obtaining posterior distributions that provide a comprehensive quantification of parameter uncertainty.
In order to carry out the MCMC, it is crucial to reformulate the model ( 5) by rep-resenting the skew-t distributions using the stochastic representation considered by [22] (See Appendix B).To achieve this, we introduced random vectors as well as random variables v ϕ and v ε to represent the skew-t distributions associated with the random effects ϕ i and model errors ε i , respec- tively.Consequently, we present the hierarchical reformulation of model ( 5) as follows: Where N b () , in general, stands for a multivariate nor- mal distribution with a dimension of b, and Ŵ() denotes a gamma distribution.
The hyperparameter matrices � α and D ϕ are assumed diagonal for convenient implementation.Then, the prior distribution of all the parameters, denoted as π( ) , can be defined as the product of the individual prior distributions of each parameter.
Suppose G = y i , Z i , W ϕ i , W ε i , v ϕ , v ε be the observed data.An approximation of the posterior density of Ω given G can be obtained as follows: where f (G| ) is the joint likelihood function of G given Ω and The Metropolis-Hastings algorithm within Gibbs sampler can be used to draw samples from the full conditional posterior distributions of the parameters and to estimate their posterior means and standard deviations.For all models, the Markov chain Monte Carlo (MCMC) procedure was implemented using Win-BUGS14 software, which simplifies the implementation of the MCMC algorithm by eliminating the need to derive full conditionals and specify the algorithm explicitly.

Model comparison and diagnostics checking
The specification and implementation of the proposed model in the Bayesian approach may require to conduct convergence diagnostic checks and thoroughly examine the distributional assumptions before drawing any statistical inferences about the parameters.Failure to (7) do so may result in biased estimates and invalid statistical inference.Thus, in this study, the Brooks-Gelman-Rubin (BGR) plot [23], trace plot, ACF plot and the Geweke's test of convergence are all used to evaluate convergence.After confirming convergence, we proceed to evaluate the effectiveness of the proposed semiparametric mixed-effects model ( 5) by exploring various distributional assumptions for the random-effects and model errors.This model comparison involves considering different distributional specifications and examining their performance in capturing the underlying characteristics of the data.These specifications are given below: MoSTST: A semiparametric partially linear mixedeffects model (SPPLMEM) with multivariate skew-t (ST) distributions for both the random-effects ϕ i and model errors ε i .
MoNST: An SPPLMEM with multivariate normal (N) distribution of ϕ i and ST-distribution of ε i .
MoSNSN: An SPPLMEM with both ϕ i and ε i follow multivariate skew-normal (SN) distributions.
MoNN: An SPPLMEM with multivariate normal (N) distributions of ϕ i and ε i .MoNN model is the stand- ard choice in longitudinal data analysis.
In order to evaluate the performance of the estimators and make comparisons between different models, we utilised several statistical measures.Additionally, to compare and select the best-fitting Bayesian model for the skewed longitudinal response, we employed the deviance information criterion, which takes into account both the goodness of fit and model complexity.however, the model with a skew-t distribution for both random effects and model errors (MoSTST) demonstrated superior performance.This suggests that incorporating skewness in modelling the longitudinal data and proposing a more flexible distributional assumption (skew distribution) allows for better capturing the inherent asymmetries and heavy tails present in the data, leading to more accurate estimates.Overall, these results emphasize the advantages of employing Bayesian SPMEM with skew distribution over the conventional Gaussian model, offering greater flexibility and improved performance in accurately modelling complex longitudinal data.

Results of the CKD data analysis
In this paper, we included diabetes and hypertension as binary covariates based on the real CKD dataset and three spline basis functions of time with four randomeffects to model and analyse the longitudinal response, the estimated glomerular filtration rate (eGFR).Accordingly, we reformulate the general semiparametric mixedeffects model ( 6) as follows: where the parameter vectors α, λ, ϕ i , and Φ(Time ij ) are as defined as in the simulation section study.In order to obtain an approximation of the spline bases, we considered two internal knots at 9 and 25 months and two boundary knots at 0 and 96 months.The locations of ( 12)  these knots were determined based on the quantiles of the distribution of observed measurement time points.We proceed to analyse the CKD data using the proposed model with varying distributional assumptions, and subsequently compare and interpret the results.We begin by initially comparing the performance of two models: the proposed semiparametric (partially linear) mixedeffects model (SPPLMEM) specified in Eq. ( 12), and a fully parametric (linear) mixed-effects model (FPLMEM) that assumes Gaussian distributions for both the random effects and model errors.The FPLMEM is specifically defined as follows: where Time ij denotes the observed measurement time of the longitudinal biomarkers for the i th subject at the j th visit.The results (Table 2) show that the estimates of some parameters become large from FPLMEM compared to SPPLMEM.For instance, the estimates of α 2 and σ 2 from SPPLMEM were − 6.38 and 60.19, while from FPLMEM they became − 7.58 and 72.30, respectively.In addition, in order to select the most suitable Bayesian model that accurately represents the CKD data, we also compute the deviance information criterion (DIC) [24].Our analysis reveals that the SPPLMEM gives a lower DIC value (DIC = 10, 290) in comparison to the FPLMEM (DIC = 10, 430).After selecting the SPPLMEM as the most suitable model that accurately represents the data, we proceed to further compare four different SPPLMEMs by taking into account different distributional specifications.For model errors and random-effects as described in the simulation study.We fitted four Bayesian semiparametric mixed-effects models to the CKD data.The MCMC setup, computations, and convergence diagnostic methods employed were identical to those described in the simulation study.Table 3. displays a summary of the data analysis results and estimates for the parameters (Par) obtained from the four models with different distributional specifications.
As shown in Table 3., the CKD data analysis results reveal that each model produces slightly varied yet statistically significant estimates of most of the parameters.When comparing the models, the findings reveal that the utilization of the 4th model (MoNN), which employs a multivariate normal distribution for random effects and model errors, may result in an overestimation of some of the parameters.Specifically, the population parametersα 1 , α 2 , α 3 , 1 , 2 , and 3 are prone to being overes- timated.Notably, as can be clearly seen, the estimated scale parameter (the variance) of model errors ( σ 2 ε ) is (13) significantly larger in MoNN compared to the other models.The 3rd model (MoSNSN) also gives larger parameter estimates (e.g., σ 2 ε ) compared to the first two models.Furthermore, the estimated skewness parameter of the outcome eGFR ( δ ε ) is significantly different from zero in the first three models: MoSTST, MoNST, and MoSNSN.Some of the skewness parameters of the random effects ( δ ϕ ) are also significantly different from zero in MoSTST and MoSNSN.Thus, the significantly different from zero positive estimates of δ ε and the subject-specific random intercept δ ϕ 1 confirm the presence of positive skewness in the longitudinal eGFR data.In other words, the non-zero estimates of the skewness parameters and relatively small estimates of the variances may indicate that the proposed Bayesian models with skew-t distribution of model errors and/or random effects (MoSTST and MoNST) fit the CKD data well.This is in line with the results of the simulation studies.
In general, the proposed models (MoSTST and MoNST) outperform and the standard MoNN.In particular, MoNST has been chosen as the best model for further in-depth interpretation and discussion of the results because it has a relatively small DIC value, despite the fact that both MoSTST and MoSNSN have some significant skewness parameter estimates for the random effects.As can be seen from the simulation studies, MoNST also has a lower DIC value.This finding, a mixed model (skewed in our case) with a normal distribution of random-effects, is consistent with the study [15].
The results of all models indicate that the variables examined in this study, namely hypertension, diabetes, and follow-up time (the spline bases), are statistically significant factors contributing to the decline of patients' kidney function.This is attributed to the negative and significant association between these covariates and the response variable, eGFR.In other words, it is evident that these covariates have a substantial association with the decrease in GFR estimates.For example, the diabetes coefficient ( α 2 = −6.66, 95% CI: [− 10.16, − 3.11]) from MoNST (the best-fitting model) can be interpreted as the eGFR value of a CKD patient with diabetes being reduced by 6.66 units compared to a CKD patient without diabetes, while holding the same covariates and random effects.Additionally, a hypertensive CKD patient is associated with a 4.44 unit lower eGFR value ( α 3 = −4.44 , 95% CI: [− 5.45, − 3.46]) compared to a non-hypertensive CKD patient, with the same covariates and random effects.

Discussion
In recent years, there has been a growing emphasis in the literature on effectively modeling longitudinal data with many features.This includes giving careful consideration to the functional forms of longitudinal markers and the assumptions made about the distribution of random effects and model errors.With this in mind, the main objective of this study was to develop a flexible Bayesian mixed-effects model that addresses the problems commonly observed in longitudinal CKD data, encompassing characteristics such as skewness, non-linear effects over time, and flexible distributions for both random effects and model errors.The ultimate goal was to establish a robust statistical methodology that enables accurate and reliable inference in complex longitudinal data analysis.
We therefore proposed a Bayesian semiparametric mixed-effects model for the longitudinal response eGFR that addresses the above issues.To capture the non-linear effects of time and the flexibility of eGFR, regression splines were employed in the model.Additionally, multivariate skew distributions were incorporated to account for skewness in eGFR and to relax the assumptions about its distribution.Simulation studies were first conducted to provide a comprehensive description and evaluation of the performance of the proposed model.
We applied the proposed model by analysing data on chronic kidney disease (CKD) and assessing the relationship between covariates and estimated glomerular filtration rate (eGFR).The model comparison process in this study involved two steps.Firstly, we compared the proposed semiparametric partially linear mixed-effect (SPPLM) model with the fully parametric one (FPLM), and our results indicated that the SPPLM model outperformed the FPLM model.In the second step, we further compared four different SPPLM models, each assuming different distributions for the random-effects and model errors.As described in the data analysis and results, the SPPLM models with skew-t distribution exhibited a superior fit to the CKD data in comparison to the Gaussian SPPLM model.
The findings from the application revealed that hypertension, diabetes, and follow-up time had a substantial association with kidney function, specifically leading to a decrease in eGFR.These factors were identified as important predictors and exhibited a negative correlation with kidney function.
Additionally, the results of this study imply that when dealing with longitudinal data characterized by the aforementioned features, it is useful to incorporate nonparametric smoothing functions (splines) to capture non-linear time-effects and utilize skew distributions for model errors and/or random effects.In particular, accounting for skewness in the longitudinal data analysis by utilizing a more flexible distribution, the skew-t distribution, is crucial to handle asymmetry in the data and get unbiased results.By doing so, we can obtain less biased results and draw valid statistical inferences.Additionally, employing a flexible distributional assumption for the random-effects can lead to a more accurate explanation of subject-specific variations.
Apart from the CKD follow-up data that served as motivation, our methodology has broader applicability in cases where the longitudinal data have similar characteristics and the fundamental model requirements (or settings) are satisfied.

Conclusion
In conclusion, we have proposed a semiparametric Bayesian modeling approach with flexible distributions for complex longitudinal data.The results of the simulation and application studies have demonstrated that our work has made a significant contribution towards a more robust and adaptable methodology for modeling intricate longitudinal data.We recommend paying special attention to the specifications of the functional forms of longitudinal biomarkers and distributional assumptions of model errors when modeling complex longitudinal data.

Fig. 1
Fig.1The trajectories and distribution of the outcome eGFR: (a) the line-plots of eGFR over time for some randomly selected patients, indicating non-linear patterns in the trajectories of eGFR; and (b) the histogram with density for all the patients, indicating that eGFR has a distribution that is skewed towards the left k

Table 1
Simulation Results: Parameter Estimates (Est) with True Value (TV), RB, RMS Error, CP, and DIC for Each Model

Table 2
Comparison of Parameter Estimates (PE) between the Proposed Semiparametric Mixed-Effects Model (SPPLMEM) and the Fully Parametric Mixed-Effects Model (FPLMEM)

Table 3
Summary results of CKD data analysis based on four Bayesian models with different distributional specifications EPM Estimated Posterior Mean, StD Standard Deviation, CI 95% Credible Interval